In the previous lesson, we looked at Bigtable where we learned about BigTable which provides low latency and high performance with petabytes of data. But, BigTable is costly. So, what if you need something cost-effective for analysis of terabytes to petabytes of scale with latency in seconds?

BigQuery is the right choice for that. We have used BigQuery while creating billing exports. Let’s see BigQuery (BQ) in details in this lesson.

Introduction#

BigQuery is GCP’s serverless, highly scalable, and cost-effective cloud data warehouse.

It allows for super-fast queries at petabyte scale using the processing power of Google’s infrastructure. Because there’s no infrastructure for customers to manage, they can focus on uncovering meaningful insights using familiar SQL without the need for a database administrator.

It’s economical because you pay only for the processing and storage you use.

BigQuery is part of Google Cloud’s comprehensive data analytics platform that covers the entire analytics value chain including ingesting, processing and storing data, followed by advanced analytics and collaboration.

Architecture#

Creating a Data warehouse requires a lot of money along with a super administration and efficient server management. Bigquery solves this problem using a very innovative architecture.

BQ decouples the storage and compute and allows them to scale independently on demand. This structure offers both immense flexibility and cost controls for customers because they don’t need to keep their expensive compute resources up and running all the time. This allows us to ingest all sizes and types of data into BQ and start analyzing using SQL.

BQ Architecture
BQ Architecture

BQ architecture has:

  • Dremel, Compute service of a large multi-tenant cluster that executes SQL queries.

  • Storage is Colossus, Google’s global storage system.

  • Compute and storage talk to each other through the petabit Jupiter network.

  • BigQuery is orchestrated via Borg, Google’s precursor to Kubernetes.

This architecture allows BQ to be an effective serverless data warehouse and allows control at the granular level.

Ingesting data into BigQuery#

BigQuery supports several ways to ingest data into its managed storage. The specific ingestion method depends on the origin of the data.

For example, some data sources in GCP, like Cloud Logging and Google Analytics, support direct exports to BigQuery.

BigQuery Data Transfer Service enables data transfer to BigQuery from Google SaaS apps (Google Ads, Cloud Storage), Amazon S3, and other data warehouses (Teradata, Redshift).

Streaming data, such as logs or IoT device data, can be written to BigQuery using Cloud Dataflow pipelines, Cloud Dataproc jobs, or directly using the BigQuery stream ingestion API. These are higher-level data pipeline architecture services. You can ignore these for now. Just remember the ways through which you can ingest data into BQ.

Accessing BigQuery#

You can access BigQuery in multiple ways:

  • Using the GCP console.
  • Using the command-line tool bq.
  • Making calls to the BigQuery REST API
  • Using a variety of client libraries such as Java, .NET, or Python.

We will see the first 2 methods in this course.

Using GCP Console#

To access BQ,

  • open Main menu > Bigdata > BigQuery. There are frequent updates to the UI by GCP. So, you might see a different UI at first which will be a preview of the update. Click on the Hide Preview Features button to get back to stable UI.

At Preview mode, products or features are ready for testing by customers. Preview offerings are often publicly announced but are not necessarily feature-complete, and no SLAs or technical support commitments are provided for these. Unless stated otherwise by Google, Preview offerings are intended for use in test environments only. The average Preview stage lasts about six months.

Click on the Hide Preview Features button if you see the button.
Click on the Hide Preview Features button if you see the button.
1
1
2
2
3
3
4
4
BQ Dashboard.

On the UI, we have:

  1. ADD DATA: Can be used to add datasets to the projects.

  2. Project: Project contains Dataset and the datasets contain tables. Expand the project in the left pane.

  3. Dataset: A collection of tables.

  4. Table: The last unit of the hierarchy. All the data is in the table.

BQ project is different than the GCP project. We can have multiple projects listed in BQ. It is just a logical separation of datasets.

The right side view is all about resource-specific options. The options changes as per the resource selection.

Next, we will add petabytes of data to our BQ console. Bigquery comes with a lot of public datasets which we can start using with a click. We will use “Stackoverflow” datasets.

  1. Click on ADD DATA -> Explore public data set.
Click on Explore public data sets.
Click on Explore public data sets.
  1. Search for stack overflow.

  2. Click on the data set add click on “view data set”.

  3. This will add all public datasets to the BQ Console.

Search for stack overflow and click on the dataset.
Search for stack overflow and click on the dataset.
Click on View Data Set button to pin the dataset in BQ dashboard.
Click on View Data Set button to pin the dataset in BQ dashboard.

Querying tables

  1. Search for stack to get the stack overflow dataset from all other datasets.

  2. Click on table name posts_answers.

  3. Click on “Query Table”.

  4. Complete the query in the query builder using the wildcard * for column names.

  5. Look at the estimation of data processed by the query even before the query is executed. This helps in estimating the cost of the query.

  6. Run the query. It will take only a few seconds to process ~25 GB of data (You might see a different amount of GBs).

If you provide the column names, BQ will process only those columns. This will reduce the cost also.

1
1
2
2
3
3
4
4
5
5
6
6
Querying tables in BQ.

Using bq CLI#

  1. Open Cloud Shell.

  2. Type bq query --use_legacy_sql=false --dry_run 'SELECT * FROM bigquery-public-data.stackoverflow.posts_answers LIMIT 1000'

  3. The --dry_run flag is very important. It provides the cost estimation of the query. Most organizations have policies to approve query costs hence knowing the amount of data going to be processed is very crucial.

Using bq CLI.
Using bq CLI.

Try running these commands with the help of the terminal given below. When you start the terminal, it will ask you to log into your Google cloud account. Once logged in you can then use the Google cloud CLI to run bq commands.

Terminal 1
Terminal

Click to Connect...

Usecases of BQ#

An enterprise will use BQ whenever they need to:

  • Have a 360⁰ view of their businesses: As the cost of storage and data processing reduces, enterprises want to process, store and analyze all relevant datasets, both internal and external to their organization.

  • Be situationally aware of and responsive to real-time business events: Enterprises need to gain insights from real-time events and not wait for days or weeks to analyze data. The data warehouse needs to reflect the present state of the business at all times

  • Reduce time to insights: Enterprises need to get up and running fast without waiting days or months. for hardware or software to be installed or configured.

  • Make insights available to business users to enable data-driven decision-making across the enterprise: In order to embrace a data-driven culture, enterprises need to democratize access to data.

  • Secure their data and govern its use: Data needs to be secure and accessible to the right stakeholders inside and outside of the enterprise.

Partitioning and Clustering#

Biqgquery is known for cost-saving and efficient data retrieval mechanisms. Partitioning and clustering are one of them. Apart from this BQ also use query caching to reduce data processing.

In Partitioning, data is grouped together on basis of the frequently accessed columns in the WHERE clause. So, the processing is done only in the specific partition. Most of the time a month is used for partitioning.

Clustering is a way to club frequently accessed columns together. It provides efficiency for nested search.

To know these concepts in detail complete the following short lab on BQ.

Also, as an exam tip, whenever you encounter the words “database”, “terabytes/petabyte scale” and “analytics” BQ is the answer.

This was the short introduction to Bigquery. BQ has a lot many things which are out of scope for this course. However, whatever we have learned is more than enough for the exam.

In the next lesson, we will look at Cloud SQL which a fully managed RDBMS service for MySQL, SQL server and PostgreSQL.

Cloud Bigtable

Cloud SQL